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psj ■ Abstract 



In this paper we present a samphng framework for RNA structures of fixed 
topological genus. We introduce a novel, linear time, uniform sampling al- 
gorithm for RNA structures of fixed topological genus g, for arbitrary g > Q. 



Q ■ Furthermore we develop a linear time sampling algorithm for RNA structures 

of fixed topological genus g that are weighted by a simplified, loop-based 
energy functional. For this process the partition function of the energy func- 
J^ \ tional has to be computed once, which has 0{n^) time complexity. 

^^ • Keywords: RNA secondary structure, RNA pseudoknot structure, 

C^^ ■ diagram, topological surface, genus, partition function, samphng 

^' 

. 1. Introduction 

^^ Pseudoknots have long been known as important structural elements in 

^ I RNA [l|. These cross-serial interactions between RNA nucleotides are func- 

^ I tionally important in tRNAs, RNaseP |2|, telomerase RNA |3|, and ribosomal 

RNAs [^ . Pseudoknots in plant virus RNAs mimic tRNA structures, and in 
vitro selection experiments have produced pseudoknotted RNA families that 
bind to the HIV-1 reverse transcriptase [5|. Import general mechanisms, such 
as ribosomal frame shifting, are dependent upon pseudoknots |6|. 
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Lyngs0 et al. |7] have shown that the prediction of general RNA pseudo- 
knot structures is NP-complete. Thus, in order to provide prediction tools 
of feasible time complexity one frequently sticks to subtle subclasses of pseu- 
doknots suitable for the dynamic programming paradigm [8|, |9| . Alternative 
approaches to the prediction of RNA secondary structure (with or without 
pseudoknots) build on random sampling of foldings compatible to a given 
sequence. Here both, the underlying probability model and the efficiency of 
the sampling algorithm are crucial for being successful. 

In this paper we propose a linear time uniform random sampler for 
pseudoknotted RNA structures of given genus which might be considered 
a promising starting point for the design of efficient solutions to the struc- 
ture prediction problem. Our approach is based on the observation that 
pseudoknotted RNAs are in a natural way related to topological surfaces. In 
fact pseudoknotted RNA structures can be viewed as drawings on orientable 
surfaces of genus g, that is by means of the classical classification theorem 
either on the sphere (secondary structures) or connected sums of tori (pseu- 
doknotted structures). Ou r appro ach is a natural evolution from Waterman 
et al. pioneering work [lO|, |ll|, [l2| on secondary structures. 



Secondary structures are coarse grained RNA contact structures, see Fig- 
ure [1] (A). They can be represented as diagrams, i.e. labeled graphs over the 
vertex set [n] = {l,...,n} with vertex degrees < 3, represented by draw- 
ing its vertices on a horizontal line and its arcs {i,j) {i < j), in the upper 
half-plane, see Figure [H We assume the vertices to be connected by the 
edges {i,i + l}, 1 < i < n, which are not considered arcs (but contribute to a 
nodes's degree). Furthermore, vertices and arcs correspond to the nucleotides 
A, G, U and C and Watson-Crick base pairs (A-U, G-C) or wobble base 
pairs (U-G), respectively. 

Considering only the Watson-Crick and wobble base pair RNA structures, 
we set the restriction that one vertex can only paired with at most another 
vertex. Let i < r, we call arcs {i,j) and (r, s) crossing iii<r<j<s 
holds. In this representation a secondary structure is a diagram without 
crossing arcs. Otherwise, i.e. diagrams with crossings represent pseudoknot 
structures, see Figured] (B). 

In this paper, we present a framework for generating diagrams with cross- 
ings, filtered by topological genus, with uniform probabilities. The topologi- 
cal filtration of RNA structures has first been proposed by Penner and Wa- 



terman in [13[ and later, as an application of the Matrix model [IJ], in Il5|. 



The work here however is based on the combinatorial work of Chapuy [l6| 
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Figure 1: A secondary structure (A) and a pseudoknot structure (B) and their 
diagram representation. 

In order to understand how topology enters the picture for RNA molecules 
we need to pass from diagrams or contact-graphs to that of topological sur- 
faces. Only the associated surface carries the key invariants leading to a 
meaningful filtration of RNA structures. The mental picture here is to 
"thicken" the edges into (untwisted) bands and to expand each vertex to 
a disk as shown in Figure |2l This inflation of edges leads to a fatgraph D 



17, 18 



A fatgraph, sometimes also called also a "map" , is a graph equipped with 
a cyclic ordering of the incident half-edges at each vertex. Thus, D refines its 
underlying graph D insofar as it encodes the ordering of the ribbons incident 
on its disks. In fact a fatgraph constitutes to a cell-complex structure - 
combinatorial data in a sense- that have a topological surface as geometric 
realization 19 . 

Our sampling process consists of two steps: first we generate a diagram 
without crossing arcs and second we hft the topological genus to some fixed 
g. The process has linear time and is thereby very efficient. 

The paper is organized as follows: we first introduce the topological filtra- 
tion of diagrams. Then we introduce a genus induction process and finally, 
we describe and analyze the sampling processes. 



2. Some basic facts 

2. 1 . Diagrams 

A diagram is a labeled graph over the vertex set [n] = {l,...,n} in 
which each vertex has degree < 3, represented by drawing its vertices in a 



horizontal line. The backbone of a diagram is the sequence of consecutive 
integers (1, . . . , n) together with the edges {{i, i + l}\l<i<n — 1}. The 
arcs of a diagram, {i,j), where i < j, are drawn in the upper half-plane. We 
shall distinguish backbone edges {z, z + 1} from arcs {i,i + l), which we refer 
to as a 1-arc. Two arcs (z, j), (r, s), where i < r are crossing iii<r<j<s 
holds. The arc (1,"^.) is called rainbow, see Figure [31 





(B) 




Figure 2: From graphs to fatgraphs: (A) A graph with 4 vertexes and 4 edges. 
(B) Inflation of a vertex. (C) A fatgraph derived from (A) induces a topological 
surface. 



2.2. Fatgraphs and unicellular maps 

In this section, we discuss the filtration of diagrams by topological genus. 
In order to extract topological properties of diagrams those need to be en- 
riched to fatgraphs. The latter are tantamount to a cell-conaplex structures 
over topological surfaces. Formally, we make this transition 20| by "thicken- 
ing" the edges of the diagram into (untwisted) bands or ribbons. Furthermore 
each vertex is inflated into a disc as shown in Figure [2] (B). This inflation of 
edges and vertices means to replace a set of incident edges by a sequence of 
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half-edges. This constitutes the fatgraph 

A fatgraph is thus a graph enriched by a cyclic ordering of the incident 
half-edges at each vertex and consists of the following data: a set of half- 
edges, H, cycles of half-edges as vertices and pairs of half-edges as edges. 
Consequently, we have the following definition: 

Definition 1. A fatgraph is a triple {H, a, a), where a is the vertex-permutation 
and a a fixed-point free involution. 

In the following we will deal with orientable fatgraph^. Each ribbon has 



"'^Here ribbons may also be allowed to twist giving rise to possibly non-orientable surfaces 
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Figure 3: A diagram over 50 vertices. The arc (10, 11) is a 1-arc. The arcs (3,22) 
and (12,34) are crossing. The dashed arc (1,50) is the rainbow. 

two boundaries. The first one in counterclockwise order shall be labeled by 
an arrowhead, see Figure |2] (C). 

A fatgraph D exhibits a phenomenon, not present in its underlying graph 
D. Namely, one can follow the (directed) sides of the ribbons rotating coun- 
terclockwise around the vertices. This gives rise to D-cycles or boundary 
components, constructed by following these directed boundaries from disc to 
disc. Algebraically, this amounts to form the permutation 7 = a o a. 

In the following we consider only diagrams with rainbow. As we shall 
see, the rainbow arc provides a canonical first boundary component, which 
travels on top of the rainbow arc and around the backbone of the diagram, 
see Figure m 
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Figure 4: (A) A diagram. (B) the fattening of (A) augmented by the rainbow (0, 7). 
Here a = (0,1,2,3,4,5,6,7), a = (0, 7)(1,4)(2, 5)(3, 6). Accordingly 7 = 000- = 
(0,4, 2, 6)(1, 5,3)(7) has two cycles. (C) Collapsing the backbone into a vertex. 



A fatgraph, D, can be viewed as a "drawing" on a certain topological 
surface. D is a 2-dimensional cell-complex over its geometric realization, 
i.e. a surface without boundary, Xjo, realized by identifying all pairs of edges 



19| . Key invariants of the latter, like Euler characteristic 19 



x(Xd) = v-e + r, (1) 

9{Xn) = l-^xiXo), (2) 

where v, e, r denotes the number of discs, ribbons and boundary components 



in ro [19| are defined combinatorially. However, equivalence of simplicial and 



singular homology [21| implies that these combinatorial invariants are in fact 
invariants of X© and thus topological. This means the surface Xp provides 
a topological filtration of fatgraphs. 

Since, adding a rainbow or collapsing the backbone of a diagram does 
not change the Euler characteristic, the relation between genus and number 
of boundary components is solely determined by the number of arcs in the 
upper half-plane: 

2-2g-r = l-n, (3) 

where n is number of arcs and r the number of boundary components. The 
latter can be computed easily and allows us therefore to obtain the genus of 
the diagram. 

Definition 2. A unicellular map m of size n is a fatgraph xn{n) = {H, a, a) 
in which the permutation a o a is a cycle of length 2n. 

While unicellular maps are simply particular fatgraphs, they naturally 
arise in the context of diagrams, by two observations. First in the diagram 
one may collapse the backbone into a single vertex. Second the mapping 

71 : {H,a,a) t-)- {H,a o a,a), 

is evidently a bijection between fatgraphs having one vertex and unicellular 
maps, see Figure El The mapping is called the Poincare dual and inter- 
changes boundary components by vertices, preserving topological genus. In 
the following, we use vr to denote the Poincare dual. 

Given a unicellular map the permutation a and 7 induces two linear 
orders of half-edges 

r <^< 7(r) <^ ■ ■ ■ <^ l^"~\r), r <^< a{r) <^ ■ ■ • <^ a'^(r). 

Let ai and 02 be two distinct half-edges in m. Then ai <^ 02 expresses 
the fact that ai appears before 02 in the boundary component j = a o a. 
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Figure 5: The Poincare dual: we map a fatgraph with 1 vertex and 3 boundary 
components into a fatgraph with 3 vertexes and 1 boundary component. 



Suppose two half-edges ai and a2 belong to the same vertex v. Note that 
V is effectively a cycle which we assume to originate with the first half-edge 
along which one enters v traveling 7. Then ai <„ 02 expresses the fact that 
ai appears (counterclockwise) before 02- 

The Poincare-dual maps the rainbow into a distinguished vertex of degree 
one and provides thereby a natural origin for the cycle 7. We call this vertex 
the plant, see Figure [5l Given a unicellular map we call a half-edge the 
minimum half-edge of a vertex v if it is the first half-edge via which 7 visits 

V. 

2.3. Genus induction 



In this section we present a construction of IGj, which plays a key role 
for our main result. It consists of two processes: a slicing- map S and a 
gluing-map A, which, when restricted to the proper classes, are inverse to 
each other. 

The slicing process splits a vertex into {2g + 1) vertices and thereby 
reduces the genus of the map by g. Gluing is effectively inverse to shcing, 
namely: gluing any {2g + 1) vertices in a unicellular map increases the genus 
of the map by g. Slicing and gluing preserve unicellularity. 

Definition 3. A half-edge h is an up-step if h <^ cr(/i), and a down-step if 
o"(/^) <7 h. h is called a trisection if /i is a down-step and cr{h) is not the 
minimum half-edge of its respective vertex. 

The number of trisections in a unicellular map of genus g is given by the 
following lemma: 

Lemma 1. [16] Let ra be a unicellular map of genus g. Then m has exactly 
2g trisections. 



Given a unicellular map m and a vertex v together with a trisection r 
contained in v. Let ai be the minimum half-edge of v. Then set 03 = a(r) 
and 02 to be the smallest half-edge between ai and a^ (with respect to the 
order <-^) such that 02 >7. 

Since r is a trisection such an a2 exists. Then we refer to the replacement 
of 

by the three vertices where Vi = {ai, hi,^ , • • • , h"^'), ^ = 1,2, 3, see Figure |6l 
as slicing. Slicing produces the unicellular fatgraph rn = {H,W, a). 
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Figure 6: Illustration of gluing and slicing in a unicellular map. 

Conversely, let m be a unicellular map and let ai, 02 and a^ be three half- 
edges belonging to three distinct vertices, Vi = {ai,hi,^ , . . . ,h™'^) for some 
rrii > and i = 1,2, 3. Furthermore suppose ai <^ 02 <7 03. 

Then, replacing the cycles vi, V2 and v^ by the cycle 

v = {a,,h\,...,hr\a2,h\,...,hT,a,,h\,...,K^), 

is referred to as gluing. Gluing produces the unicellular fatgraph m = 
{H,W,a), see Figure |6l in which the half-edge ^"^(03) is, by construction, a 
trisection. 



Lemma 2. fldj Slicing maps a unicellular map together with a trisection 
into a unicellular map together with three labeled vertices. Gluing maps a 
unicellular map together with three labeled vertices into a unicellular maps 
with a trisection. 

Suppose we shce (m, r) into (m, Wi, f2, "^3), where in rn holds oi <- 03 <- 
02. Then we observe that in m ai remains minimum in its new vertex and so 

8 



does 02, because 02 is by definition the minimum half-edge where 03 <y 02. 
However, 03 becomes either the minimum half-edge, or remains a half-edge 
following a trisection. This gives rise to two types of trisections: 

Definition 4. Let in be a unicellular map and v a vertex containing a tri- 
section T. Shcing (m, t) we obtain {m,vi,V2,vs). If the minimum half-edge 
of f3, denoted by 03 is the half-edge a(r) in rn, we call the trisection r to be 
of type I and type II, otherwise. 

Proposition 1. Let trig denote a unicellular map of genus g having n edges. 
Let furthermore t^ denote a trisection of type I and r^^ denote a trisection 
of type II. Then we have the mappings $ and \l/; 



$(mg, wi, t;2, v^) = (mg+i, r^), ^(m^, vi, V2, r) = (m^+i, t^^) 

are bijections, where Vi, V2 and v^ denote three distinct vertices in m^ and 
TTig+i is a unicellular map of genus g + 1 having n edges. 

Here $ generates the trisection r^ in a unicellular map of genus g + l and 
the trisection r^^ persists when applying the mapping \l/. 

Gluing can be described as follow: 
Given a unicellular map of viVg-k, together with a sequence of vertices V = 
{f 1, . . . V2k+i}, where Vi <^ fi+i, VI < i < 2A; + 1. Then: 

I. we glue the last three vertices V2k-i, V2k and V2k+i via $, thereby obtaining 
the unicellular map xrig-k+i together with trisection r^. 

II. we apply ^(m^^fc+j, V2k~2i~i, V2k-2i, t^) k-1 times for i = 1 to i = /c - 1. 
This produces the unicellular map mg{n), together with a trisection r^^. The 
process defines a mapping 

A(mg_fc, ^^1, . . . , V2k+i) = (mg, r), 

where we do not label r by type since in general we do not know whether \l/ 
has been applied. The order of the vertices in V is given by the partial order 
determined by 7. Thus V can be considered as a set of vertices in xng^k, 
ordered by < 7. A merges vertices from right to left by first applying $ once 
then applying \1' several times. 

A is reversed as follows: given a unicellular map m^ of genus g and i = 0: 
1. if r is type II trisection in mg_j, then let (mc,_j_i,f2j+i,f2i+2, t) = 
\I'^^(mg_j, r). We increase i to i + 1 and repeat step 1. 



3. if r has type I, let img_i,V2i+i,V2i+2,V2i+3) = * Hmg_i_i,r). 
Then we return 

S(m3,r) = (m3_i,K). 

By construction, A and S are inverse to each other. 



Theorem 1. 11 d / Let Ug denote the set of tuples {mg,vi, . . . ,Vt), where 
vi,...,Vt is a sequence of vertices in xxig. Furthermore, let Dg denote the 
set of tuples (rrig,r), where t is a trisection ofvcig. Then 

are bijections and A o H = id and S o A = id. 

Let eg{n) denote the number of unicellular map of genus g having n edges. 
Then we have the following enumerative corollary 

Corollary 1. 

/n + 1 - 2((7 - 1)\ , ^ f n + l\ , ^ 

2g-eg{n)=l^ ^^' ^ je,_iH + ■ ■ ■ + (^^^ ^ J^oH- (4) 

Here the 2g-iactoT on left hand side counts the number of trisection in 
trig and the binomial coefficients on the right hand side counts the number 
of distinct selections of subsets of {2k + 1) vertices from a unicellular map 

Iterating S, we obtain 



egW 



o=ao<g 



S._niG(;-.""0-^°'"'' '^' 



where eo(^) is the number of planar trees having n edges, i.e. the Catalan 
number -^( "). 

3. Uniform generation of matchings 

In this section, we show how to generate a matching of a given genus g 
over 2n vertices with uniform probability. 

Any unicellular map m^ together with one of its 2g trisections is mapped 
via S into a unicellular map of lower genus. Note that the genus decreases 

10 



at least by one. Therefore, by iterating the process finitely many times (at 
most g), we arrive at a unicellular map of genus 0, i.e a planar tree. 

For our construction it is important to keep track of the particular slicing 
process. Accordingly, we introduce slice/glue paths as follows. 

Definition 5. Suppose trig is a unicellular map of genus g having n edges. 
Then a sequence unicellular maps 

(m = mgo=o, m = trig,, . . . , m'' = mg^=g) 

is called a slice path from m^ to mo and a glue path when considered from mo 
to m^, where H(mg., Tj) = (mg._i, Vg._J holds for some Tj in m^., < i < r. 

We next consider Pg{m^), the set of distinct glue paths from a given 
m° = mo to some unicellular maps of fixed genus g. 

Lemma 3. The cardinality of Pg{xn^) is given by 

^^2gA2(g,-gi_i) + lJ' 

0=gQ<gi<-<gr=g i=l ^* ^ ^^* ^i U ' / 

Proof. In order to construct m^. from m^.^, < i < r, we need to select 
2{gi — gi-i) + 1 vertices from mg^_-^. 

Euler characteristic shows that there are (n + 1 — 2gi-i) distinct vertices 
in mg._^, whence there are (2r^l~^^N+i) ways to select a subset of vertices 

Vg..,. 

On the other hand, the mapping A produces m^^ with a labeled trisection 
Tj, i.e., the same m^^ will be produced exactly 2gi times. Accordingly, we 
need to normalize the production by a factor l/2gi for each application of A. 

As a result the total number of glue paths in Pc,(m°) is 

^ ^^2gi\2(g,-gi^i) + lJ' 

0=go<gi<-<gr=g i=l ^' ^ ^^* tli ij < / 

which is exactly the coefficient of eo(n) in eq. ([S]). □ 

The problem of generating a unicellular map of genus g having n edges 
with uniform probability thus splits into two parts: we first generate a pla- 
nar tree mo with n edges with uniform probability. Second we generate a 
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glue path from Pg{xn^) with uniform probabihty. It is well-known how to 



implement the first step by a linear time (rejection) sampler 22(| and it thus 
remains to present an algorithm for the second step. 

We construct a glue path inductively. Suppose we are at step i and we 
have constructed a unicellular map m* of genus gi. Then the next genus f^j+i 
is suggested by the process NextGenus. This process considers the sequence 
of genus go, . . . ,gi and the target genus g as input, and returns the genus 
(yfj+i. Let P(5'i+i = t I go, . . . ,gi,g) denote the probability of gi+i equals t 
under the condition that go, . . . ,gi are the genus of the previous steps and g 
is the target genus. Then 

v^ -f] J_( n+l-2ti_i \ 

^ 11 2ti \2{ti-ti-i)+l) 

Tn,/ , I \ to=go,...,ti=9i,ai+i=t<ti+i<---<tr=gt=l 

r{gi+i = t\go,...,gi,g) = . 

^Z^ ^ 11 2t, \2{U-U^i)+l> 

to=go,...,ti=gi<---<tr=g i=l 

(6) 
Next we select the sequence of vertices from m* by process SelectVertex. 
This process chooses vertices in 2{gi^i — gi) + 1 independent steps. The 
probability of a vertex being selected is given by l/{n + 2 — 2gi — k), where 
{n + 2 — 2gi — k) is the number of remaining non-selected vertices in the 
kth step, 1 < A; < 2{gi^i — gi) + 1. Since the selected vertices are ordered 
automatically by <^ ., the same set is generated with multiplicity (2((yfj+i — 
gi) + 1)!. Normalizing the resulting term by the factor l/(2((7i+i — gi) + 1)!, 
the probability of the set V^, < i < r is given by 

""^^^^ (2(,.,, -,.) + !)! n+l-2g. n - 2g^,, U^SVi) ' 

, (7) 
After the sequence of vertices Vi is selected, a unicellular map m*+^ is 
constructed by the process Glue, applying mapping A. We present the pseu- 
docode of the procedures in Algorithm [TJ 

Assuming the target genus to be constant and taking into account that 
during our construction the genus is strictly increasing, the while-loop of Al- 
gorithm 1 is executed only a constant number of times. Using appropriate 
memorization techniques, NextGenus and Glue can be implemented in con- 
stant time and SelectVertex in linear time. Thus, combined with a linear 
time sampler for planar trees, our approach allows for the uniform generation 
of random matchings in time 0{n). 
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Algorithm 1 



Unif ormMatching {m^,TargetGenus) 

while Qi < TargetGenus do 

Qij^-i ^ NextGenus [qq, . . . , Qi, TargetGenus) 

Vi -h- SelectVertex (m*, 2{gi+i — g^) + 1) 

m^+^ ^Glue {m\Vi} 

i ^ i + 1 
end while 
return m* 



Lemma 4. Given a planar tree m° with n edges and a genus g, the probability 
of a glue path Pg generated by AlgorithmUl is eo(n)/ec,(n). 

Proof. Assume a glue path 

Pg = {m° = mgg=o, m^ = nxg, , . . . , m'' = mg,,=g} 

is generated by Algorithm [H Since for each step, the process of choosing 
the genus for the next step and selecting labeled vertices is independent, the 
probability of Pg is given by 

r-l 
i=0 
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We substitute eq. (^ and eq. ([7]) and obtain 



r-l 



HP,) = n 



Z^to=go,--;ti=gi,gi+i=t<ti+i<-<tr=g i. i.i=l 2ti \2{ti-ti_i)+lJ 



4=0 Z^to=30,-,t»=ft<-<tr=g llj=l 2ti\2{ti-U.i)+l) \2{gi+i~gi)+l) 

nj_/ n+l-2ii_i \ r-l 
_ ti=gi,...,tr=gr 2ti \2(ti-ti.^)+l) -TT i 

~ v^ T-rr J_ / n+l-2ti_i \ '11/' n+l-2gi \ 

Z^{]=to<...<U=gi.U=l2ti\2{ti~U-l)+l) j=0 l2(gi+i-9i)+l/ 

1 

~ y^ YY J_/ n+l-2ti_i \ 

Z^O=to<---<ir=g lli=l 2ti V2(ti-i,_i)+l/ 

^ ^o{n) 

^Ol""-) ■ 2^0=to<--«r=9 11*=1 2ri\2{ti-ti-l)+l) 

whence the lemma. D 

Corollary 2. Suppose a planar tree m° is uniformly generated, i.e., with 
probability l/eo{n). Then a unicellular map m'' = trig is uniformly generated 
by AlgorithmUl with probability l/eg{n). 

4. Uniform generation of diagrams 

In this section, we extend our result of Section [3] in order to generate 
diagrams of genus g with uniform probability. The idea is to uniformly 
generate first a matching of genus g with n arcs. In a second step we choose 
(£ — 2n) unpaired vertices and insert them into the matching. 

Let Pd(t = n\l., g) denote the probability of the diagram having exactly n 
arcs, < n < [£/2j. In the following we compute Prf(t = n\l^g). 

Let 5g{E) denote the number of diagrams of genus g over C vertices. Fur- 
thermore, let 5g{^, n) denote the number of diagrams of genus g over i vertices 
having exactly n arcs, 2n < L Then i — 2n vertices are unpaired and 



5gii,n) 



-2n 
Furthermore 



eg[n). 



U/2J [l/2\ / « X 



n=0 n=0 
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In order to generate a diagram of genus g over i arcs uniformly, we need to 
solve 

whence Prf(t = n\i,g) = Sg{i,n)/Sg{i). 

We present the pseudocode of Unif ormDiagram as Algorithm O The 
subroutine NumberofArcs returns n with probability Prf(t = n\i,g), which 
determines the number of arcs in diagram Dg{i). Unif omTree is a standard 
process uniformly generating a matching of genus with n arcs. Finally, 
the process InsertUnpairedVertices first chooses {i — 2n) vertices from £ 
vertices as unpaired. It leaves 2n vertices not selected, which are consid- 
ered to be paired. Then the process maps the 2n vertices of the matching 
generated by Unif ormMatching and keeps the arcs in the upper half-plane. 
Accordingly, a diagram of genus g over i vertices with exactly n arcs is gen- 
erated. The result of some experiments conducted in connection with the 
generation of random matchings and diagrams using our algorithms is shown 
in Figure [71 

Algorithm 2 



Unif ormDiagram (i^TargetGenus) 

n <r- Numberof Arcs(£, f^) 

m-o ^ Unif omTree (n) 

rrip ■(— Unif ormMatching(m, TargfetGeriMs) 

Dg -h- InsertUnpairedVertices (trig, £) 

return D„ 



5. Non-uniform sampling 

RNA structures can be represented as diagrams and are, due to the bio- 
physical context subject to certain constraints with respect to their free en- 



ergy 23|. The latter energy is oftentimes modeled as a function of the loops 



of the underlying RNA structure [23[, S. These loops are in fact equal to the 
boundary components of the fatgraph constructed from the molecule. In the 
following we shall discuss, rj{S), a simplified version of the actual bio-physical 
loop-energy of a structure S. 

Let us start with RNA secondary structures, that correspond to diagrams 
of genus 0. For a secondary structure 5*0, we denote its corresponding (see 
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Figure 7: Uniform generation: (A) matchings, n = 12, g = 2 and e2(n/2) = 6468. 
We generate A'^ = 10^ matchings and display the frequencies of their multiphc- 
ities (blue dots) together with the binomial coefficient of the uniform sampling 
(^)(l/e2(n/2))^(l - l/e2(n/2))^-^ (red). (B) The analog of (A) for diagrams. 
Here we have n = 12, g = 2 and 52{n) = 48741. We generate N = 10^ diagrams 
and display the frequencies of their multiplicities (blue dots) together with the 
binomial coefficients (^)(1/(^2(^))^(1 — ^/^2{n))^~^ (red). 



Section El duality mapping vr) unicellular map by mo = 7r(S'o). The bonds 
or arcs of the structure then correspond to edges of the unicellular map 
rrio and loops or boundary components to vertices. Three types of loops 
are distinguished: hairpin loops, interior loops (including helices and bludge 
loops) and multi-loops. Accordingly, the duality maps hairpin loops into 
vertices of degree one, interior loops into vertices of degree two, and multi- 
loop into vertices of degree greater than two, see Figure El '7(5') extends 
these types in order to deal with structures having arbitrary genus (? > as 
follows. 

Let Sg denote an RNA structure having length £, n arcs and genus g, and 
rxig = 7r(5'g) its corresponding unicellular map. Then r]{Sg) is given by 



v{Sg)=n.b + J2T{v) + Lf. 



(9) 



v£¥ 



Here b represents an energy contribution of arcs, V the set of all vertices m 
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Figure 8: Hairpin loops, interior loops and multi-loops in secondary structures and 
their translation into unicellular maps. 



T{v) is function given by 



T{v) 



'L^P ifrf(f) = l, 
L*"* if d{v) = 2, 

j^mui j£ ^^^^ ^ 2 and v has no trisection, 
if f is attached to the root, 



where d{v) is the degree of vertex v, and L^ is the contribution of a loop 
of type X, where X = {hp,int,mul}. Finally, L^'^ represents a contribution 
that stems from novel loop-types emerging for genus g > 0. In this model, 
we do not take contributions from unpaired vertices into account. 



In case oi g = 1, there are four different types of pseudoknots [2J 



see 



Figure [H This is analogous for any genus: there are always only finitely many 



corresponding shadows |25l. |24|| . see Figured Here, a shadow is a diagram 
without unpaired vertices in which all stacks (parallel arcs) have size one. 
Formally, a shadow of a structure can be obtained by first removing all its 
unpaired vertices, second removing all noncrossing arcs (together with their 
vertices) and then replacing a set of parallel arcs (and the incident vertices) 
of the form {(i, j), {i + l,j — l),...,{i + i — l,j—i+ 1)} by a single arc (and 
two vertices). 

Let us have a closer look at the boundary components of these shadows in 
case of genus 1. (H) is inspected to have one boundary component, whence 
?7(S'h) = 26 + L™"' + L^ . (K) and (L) have two boundary components and 
accordingly tjISk) = vi^h) = 36 + 2L™'"' + Lf . Finally, for (M) we have 
r/(5M)=46 + 3L--' + Lf. 

Consider a matching Si of genus 1 having n arcs and mi = vr(S'i) the 
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unicellular map given by the duality. By selecting a trisection r in mi and 
applying the mapping S, we obtain H(mi,r) = (mo,fi,f2,f3) and three la- 
beled vertices. Here we write itIq = (11x0,^1,^2,^3) for short. Let ^q de- 
note a secondary structure with three labeled boundary component where 
7r(S'Q ) = trio . Let further §q^ and Si^„ denote the set of Si and Sq respec- 
tively. By Lemma [H there are two trisections in mi. Therefore, by selecting 
the same 5*1 and different trisection r, S results in different Sq . Thus we 
have the cardinality 2|§i^„| = |Sq„|. Figure |9] shows this for the four shad- 
ows of genus 1 and their secondary structure with three labeled boundary 
components. 

(H) (K) 



1 







Figure 9: Correspondence between shadows of genus one and their labeled sec- 
ondary structures. 

We next formulate an "energy" for structures S^ , 7]{Sq ), that matches 
the energy rj for their corresponding counterpart of genus one after gluing. 
Note that this allows us to reduce everything to secondary structures with 
three labeled boundary components. To this end, let vi, V2 and V3 be three 
labeled vertices in mj) , where mQ = 7r(S'Q ). Setting T{vi) = T{v2) = 
T{v3) = {Lf + L"^"')/3 we observe 

Proposition 2. We have ri{Si) = ri{SQ ). 

Proof. The mapping S is a bijection and A(mo,fi,f2,f3) = (mi,r). The 
three labeled vertices in mo are glued as v, where d{v) > 3. Hence T{v) = 
j^mui ^ ^Pk ^ y^^^^ ^ y^^^^i _^ r^^^^-^^ because T{vi) = T{v2) = T{vs) = 

(L^ + L™'"')/3. The other vertices in mo maintain hence their scores are not 
changed. D 
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Given ri{S) we proceed along the lines of [26j and construct a proba- 
bility space of structures by computing the partition function of a given 
sequence. Let 9{n) = XIsgs ^^^^^ denote the total energy of all structures. 
A structure, S, is sampled with probability e'^^^yO{n). In case of secondary 
structures, loop-based and arc-based energy models are compatible to the 



standard recursion of secondary structure [27| and Oo[n) can be computed 
by the recursion 

n— 2 n— 3 

e{n) = J2 0{^e{n~i-l)+e^"'+^-e{l)+e^'"'+^-e{n-l)+e^"''''+^-J2 0{i)0{n~i-2), 

i=l i=l 

where t] is the energy functional, discussed above. As there is only one 
summation in the above recursion, 6{n) is computed in O(n^) time. 

We proceed by showing that the new functional, tj^Sq ), is also compat- 
ible with the secondary structure recursions. 

Lemma 5. Let ^i(n) = EseSi „ ^''^^^ and e^o\n) = E5g§(3) e''^^^- Then 

9i{n) = 9q {n)/2 can he computed in 0{n^) time. Once 6i{n) is computed, 
a structure of genus one, Si, is sampled with probability e^^^'^^ / 9i{n) in 0{n) 
time. 

Proof. We have r]{Si) = v{So^^) for all Si G §i,„ and S^^^ G SjIJi, and 2|§i,„| = 
|§onl- Therefore, 

We next show that 9q (n) can be computed in 0{n'^) time. Let 9q = 
y]c.^c,(2) e^^^'^ and 6'i = y]c^s(i) e'''^'^^ where Si^ and Sir, denote the sets of 
secondary structures with two and one labeled boundary components. The 
functional of these labeled boundary components are computed exactly as 



in the case of 5*0 



(3) 
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Then we have, see also Figure [TUl 
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(3) 
Figure 10: The recursion for 9q . 

Analogously, we have recursions for 9q (n) and 9q (n), which can be 
computed in O(n^) time. Therefore, ^i(n) = 9q (n) can be computed in 
O(ra^) time. 

In order to sample a diagram of genus 1 over i vertices, Di, we need first 
to determine its number of arcs. As in the case of uniform sampling, we 
have -^lii) = X]n=o ^ii^i^)i where ■&i{i,n) = G_2n)^i('^)- Replacing in the 
formulae for uniform sampling ei(ra) by 9i{n) and 6i{n) by '&i{n), we find that 
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the probability of sampling a diagram with n arcs is given by 'i9i(£, n)/di{t). 
It remains to sample a matching Sq with n arcs and to subsequently glue 
the three labeled vertices in hTq = vr(S'o ). This generates a unicellular maps 
of genus one, mi, which is associated to 5*1 = 7r^^(mi) by duality. Note that 
choosing different slice-paths for Si generates two different 5*0 , see eq. flTO|) . 



S'^' ^Si 

(10) 

{mQ,vi,V2,v^)- — ^(mi,r) 
The probability of a structure of genus one, S'l, is then given by 

Finally, we insert the unpaired vertices into 5*1 and obtain Di with the prob- 
ability 

^i(£) di{n) ^iW 

D 

6. Conclusion 

In this paper we have proposed an original and highly efficient (linear 
time) approach to sample random RNA pseudoknotted structures in the 
uniform and a non-uniform model. The later builds on a simplified concept 
of free energy, favoring foldings of a native appearance. This is a first step to- 
wards efficient prediction algorithms for pseudoknotted RNA since structure 
predictions of good quality can easily be derived from suitable (high quality) 



random samples (see 28j and the references given there). To this end, our 



algorithms need to be extended towards two directions: 

1. The probability model needs to be improved further, and 

2. the RNA sequence needs to be taken into account. 

An immediate application of the uniform sampler are the distributions of 
loops in structures of genus g. We have shown that the loops in structures are 
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translated into vertices of their associated unicellular maps. In particular, 
a hairpin loop corresponds to a vertex of degree one, an interior loop to a 
vertex of degree two and a multi-loop is to some vertex having degree greater 
than two without a trisection. Finally a pseudoknot loop corresponds to a 
vertex having degree greater than two containing a trisection. In Fig. [11] we 
present the respective data, filtered by genus. 



D 


R-0 


+ 


g-i 


<> 


S-2 


* 


E-3 


O 


B-4 


X 


B-5 



^SBB S R HiB i B iaa B i lllBIB i 



10 12 14 16 18 20 

length 



0.009 

0.008 

0.007 

0.006 

0.005 

0.004 

0.003 - 

0.002 

0.001 - 



Ox 




10 15 20 25 

length 



Figure 11: Loops in uniformly generated, genus filtered, RNA structures. Left: 
distribution of standard loops, where x-axis is the length of boundary component 
and y-axis is frequency. Right: distribution of pseudoknot loops. 

It is well known in context of pseudoknot-free secondary structures how 
to use either a sophisticated model for the free energy or stochastic concepts 
like the maximum likelihood approach to obtain realistic probability models 
applicable to random sampling. Our approach seems to be suitable to ap- 
ply the latter and it is a topic for future research to work out the details. 
Incorporating the sequence is a more complicated task but again results for 
classic RNA secondary structures prove it feasible with only small losses in 



efficiency [29 



Thus we assume our findings of this paper an important contribution 
towards the development of efficient structure prediction tools for pseudo- 
knotted RNA structures. Those are also in need for state of the art tools 
addressing the inverse folding problem. The latter quite often use some 
search heuristic (like e.g. a genetic algorithm) to process the space of possi- 
ble sequences using structure prediction tools to judge the quality (similarity 
to input) of current solutions. For the large number of calls, the efficiency 
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of the prediction algorithm is crucial for the applicability of the entire ap- 
proach. Today's established algorithms for the prediction of pseudoknotted 
RNA with run times in 0{n^) or worth (see [9j) seem not to be appropriate. 
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